This example demonstrates how to do a node-based comparison of species distributions, as described in Borregaard, M.K., Rahbek, C., Fjeldså, J., Parra, J.L., Whittaker, R.J. and Graham, C.H. (2014). Node-based analysis of species distributions. Methods in Ecology and Evolution 5: 1225-1235. We will use SpatialEcology functionality and the ecojulia phylogenetics package Phylo to reimplement the method from the paper, from first principles. We start by loading the basic data objects and end up with defining a function with the full functionality of the published paper.
Load data and create objects
First, let's load the data. We have the data in two DataFrames, one of species occurrences in each grid cell, and one with the lat-long coordinates of each grid cell. The DataFrame of occurrences is in the widely used phylocom format, which is a long-form format for associating the occurrence of species in sites. It consists of three columns, a column of species names, one of abundances (here 1, as it's a presence-absence data set) and a column of sites.
using CSV, DataFrames, SpatialEcology
phylocom = CSV.read("../../data/tyrann_phylocom.tsv", DataFrame)
first(phylocom, 4)4 rows × 3 columns
| Plot | Record | Species | |
|---|---|---|---|
| Int64 | Int64 | String6… | |
| 1 | 6504 | 1 | Empidonax_hammondii |
| 2 | 6504 | 1 | Empidonax_alnorum |
| 3 | 6504 | 1 | Sayornis_saya |
| 4 | 6504 | 1 | Contopus_cooperi |
The coordinates is a simple DataFrame with a column of sites, one of latitude and one of longitude
coord = CSV.read("../../data/tyrann_coords.tsv", DataFrame)
first(coord, 4)4 rows × 3 columns
| Long | Lat | cell | |
|---|---|---|---|
| Float64 | Float64 | Int64 | |
| 1 | -156.5 | 71.5 | 6504 |
| 2 | -155.5 | 71.5 | 6505 |
| 3 | -162.5 | 70.5 | 6858 |
| 4 | -161.5 | 70.5 | 6859 |
We ensure that the column of sites are represented as strings in both data sets. We then construct the Assemblage object. The site columns are used to match the two DataFrames together.
phylocom.Plot = string.(phylocom.Plot)
coord.cell = string.(coord.cell)
tyrants = Assemblage(phylocom, coord)Assemblage with 390 species in 3716 sites
Species names:
Empidonax_hammondii, Empidonax_alnorum, Sayornis_saya...Muscisaxicola_capistratus, Neoxolmis_rufiventris
Site names:
6504, 6505, 6858...51588, 51589
Let's have a look at the data
using Plots
ENV["GKSwstype"]="nul" # this is just for the docs to run remotely
default(color = cgrad(:Spectral, rev = true))
plot(tyrants, size = (600, 800))
Next, we'll read in the phylogenetic tree
using Phylo
tree = open(parsenewick, "../../data/tyrannid_tree.tre")
sort!(tree) # sort the nodes on the tree in order of size - useful for plotting
plot(tree, treetype = :fan)
Extract information from a single clade
The Phylo package uses iterators over vertices in the phylogeny for almost everything. For example, to get a vector of all internal (non-tip) nodes in the phylogeny, we would create an iterator over the names of all nodes in the tree, filtered by the function isleaf, which checks whether a node has any descendants, and then collect the iterator to a vector
nodes = nodenamefilter(!isleaf, tree)
nodevec = collect(nodes)
first(nodevec, 4)Let's pick a random node from the vector to demonstrate how we can get information on that node
randnode = nodevec[131]"Node 311"We can get a list of the names of all tips/species descending from the node by getting all descendant nodes with getdescendants and filtering with isleaf. We need to pass an anonymous function to filter here, because the isleaf function takes two arguments.
nodespecies(tree, node) = filter(x -> isleaf(tree, x), getdescendants(tree, node))
first(nodespecies(tree, randnode), 4)We can use that species list to subset an Assemblage object. For instance, we can make a function to create a smaller Assemblage of all species descending from our selected node.
function get_clade(assemblage, tree, node)
specs = nodespecies(tree, node)
view(assemblage, species = specs)
end
rand_clade = get_clade(tyrants, tree, randnode)
plot(rand_clade)
Comparing the richness of sister clades
The question we are interested in addressing here is: At a given node where the lineage splits into two sister clades - are the two descendant clades distributed differently? This could be an indication that an evolutionary or biogeographic event happened at that time, of consequence for the current distribution of the species. So let's get the two descendant clades, and plot their distribution in comparison to the parent clade
function plot_node(assemblage, tree, node)
ch1, ch2 = getchildren(tree, node)[1:2]
assm = get_clade(assemblage, tree, node)
assmch1 = get_clade(assm, tree, ch1)
assmch2 = get_clade(assm, tree, ch2)
plot(
plot(assm), plot(assmch1), plot(assmch2),
layout = (1,3), size = (1200, 400), title = ["parent" "child 1" "child 2"]
)
end
plot_node(tyrants, tree, randnode)
It is clear that the two clades have distinct distributions, with the first descendant appearing to be overrepresented in the tropical rainforest biome, mainly in the Amazon. But is the difference great enough that we can say that this is not just a random pattern? We can use randomization to find out.
Using randomization to assess significance of distribution differences
SpatialEcology Assemblages can be randomized using the curveball matrix randomization algorithm defined in RandomBooleanMatrices.jl. This algorithm randomizes a species-by-site matrix while keeping row and column sums constant, and is very fast. We can instantiate a matrixrandomizer object from our assemblage, and then use this object to repeatedly generate randomized communities
rmg = matrixrandomizer(rand_clade)
newcomm = rand(rmg)
plot(newcomm)
Because row and column sums are kept constant, the richness of the randomized community is the same. But the richness of the two descendant clades will be different - let's look at our focal node
ch1, ch2 = getchildren(tree, randnode)[1:2]
randch1 = get_clade(newcomm, tree, ch1)
randch2 = get_clade(newcomm, tree, ch2)
plot(
plot(rand_clade), plot(randch1), plot(randch2),
layout = (1,3), size = (1200, 400)
)
This represents a random expectation for the species richness of the two descendant clades should be. We can repeat this process 100 times and store the species richness of one of the clades in order to get a sampling distribution. The mean and standard deviation of this distribution can be used to assess how unexpected our empirically observed distribution is. We will focus just on one of the descendants, child clade ch2. The pattern for the other descendant is mirrored.
using Random: rand!
function simulate_descendants(clade, tree, descendant; nsims = 99)
rmg = matrixrandomizer(clade)
ret = zeros(nsims + 1, nsites(clade)) # a matrix to hold the richness values from the simulations
ret[1, :] = richness(get_clade(clade, tree, descendant))
for i in 2:nsims + 1
ret[i, :] .= richness(get_clade(rand!(rmg), tree, descendant))
end
ret
end
sims = simulate_descendants(rand_clade, tree, ch2);100×3716 Array{Float64,2}:
4.0 4.0 5.0 4.0 4.0 4.0 4.0 4.0 … 4.0 6.0 8.0 8.0 4.0 6.0 8.0
3.0 3.0 5.0 1.0 3.0 2.0 3.0 0.0 3.0 3.0 2.0 6.0 2.0 3.0 4.0
1.0 2.0 2.0 3.0 2.0 1.0 4.0 1.0 2.0 4.0 5.0 2.0 1.0 4.0 5.0
2.0 1.0 2.0 2.0 3.0 1.0 2.0 2.0 0.0 3.0 5.0 4.0 1.0 5.0 3.0
0.0 1.0 3.0 2.0 2.0 2.0 1.0 3.0 0.0 2.0 3.0 7.0 1.0 1.0 1.0
2.0 3.0 1.0 2.0 2.0 2.0 3.0 2.0 … 1.0 3.0 3.0 3.0 3.0 2.0 7.0
1.0 3.0 1.0 3.0 3.0 3.0 0.0 2.0 0.0 4.0 4.0 4.0 2.0 2.0 4.0
3.0 3.0 2.0 1.0 2.0 2.0 1.0 0.0 2.0 2.0 1.0 3.0 1.0 2.0 2.0
2.0 4.0 1.0 1.0 3.0 3.0 2.0 2.0 2.0 2.0 5.0 2.0 2.0 4.0 2.0
2.0 2.0 1.0 1.0 2.0 3.0 2.0 1.0 2.0 3.0 6.0 2.0 2.0 3.0 3.0
⋮ ⋮ ⋱ ⋮ ⋮
3.0 1.0 3.0 2.0 0.0 3.0 2.0 3.0 1.0 4.0 5.0 4.0 2.0 3.0 5.0
2.0 0.0 4.0 2.0 1.0 2.0 3.0 3.0 2.0 3.0 1.0 6.0 1.0 3.0 3.0
2.0 2.0 3.0 2.0 2.0 1.0 2.0 0.0 2.0 1.0 4.0 3.0 1.0 3.0 6.0
2.0 1.0 1.0 3.0 0.0 0.0 2.0 2.0 2.0 4.0 5.0 6.0 0.0 2.0 2.0
2.0 2.0 2.0 2.0 0.0 1.0 2.0 3.0 … 3.0 2.0 4.0 5.0 4.0 3.0 5.0
1.0 1.0 3.0 1.0 0.0 1.0 3.0 3.0 1.0 4.0 6.0 5.0 0.0 3.0 3.0
3.0 2.0 4.0 3.0 2.0 2.0 2.0 2.0 2.0 4.0 4.0 4.0 3.0 1.0 1.0
2.0 0.0 1.0 1.0 2.0 0.0 3.0 2.0 3.0 3.0 4.0 3.0 2.0 3.0 6.0
1.0 1.0 2.0 2.0 2.0 3.0 2.0 1.0 0.0 3.0 3.0 5.0 1.0 4.0 4.0Then we calculate the mean and standard deviation across simulations and use this to express the empirical richness values as standardized effect sizes.
using Statistics
function calculate_SOS(sims)
sd = std.(eachcol(sims))
me = mean.(eachcol(sims))
(sims[1, :] .- me) ./ sd
endcalculate_SOS (generic function with 1 method)The resulting standardized effect size for each grid cell constitutes the SOS metric of Borregaard et al. (2014). To calculate this for our focal cell and plot it we can do
sims = simulate_descendants(rand_clade, tree, ch1)
SOS = calculate_SOS(sims)
plot(SOS, rand_clade, clim = (-8,8), fillcolor = :RdYlBu)
We see a clear distinction between the two clades descending from our focal node, where one descendant is overrepresented in tropical moist forest and the other in colder regions.
The strength of divergence among the two clades is summarized by the GND value (Borregaard et al. 2014)
using StatsBase: tiedrank
function calculate_GND(sims)
# two internal convenience functions
logit(p) = log(p/(1-p))
invlogit(p) = exp(p)/(1+exp(p))
n = size(sims, 1)
r = [tiedrank(x)[1]/(n + 1) for x in eachcol(sims)]
p = 1 .- 2 .* abs.(r .- 0.5) .- 1/n
α = mean(logit.(p))
1-invlogit(α)
end
GND = calculate_GND(sims)
first(GND, 4)Applying the method to the entire phylogeny
We can use all of the above to go through the entire phylogeny and generate SOS and GND values. First, let us create a function that calculates both metrics
function process_node(assemblage, tree, nodename; nsims = 100)
clade = get_clade(assemblage, tree, nodename)
children = getchildren(tree, nodename)
if length(children) != 2 || any(x -> isleaf(tree, x) || nspecies(get_clade(assemblage, tree, x)) < 4, children)
return (fill(NaN, nsites(clade)), NaN)
end
sims = simulate_descendants(clade, tree, children[1]; nsims)
calculate_SOS(sims), calculate_GND(sims)
end
SOS, GND = process_node(tyrants, tree, randnode)
plot(SOS, tyrants, clim = (-8,8), fillcolor = :RdYlBu)
Finally, we go through every node on the true and calculate the metrics for them
using ProgressLogging
function node_based_analysis(assemblage::Assemblage, tree::AbstractTree)
nodevec = collect(nodenamefilter(!isleaf, tree))
SOSs = Matrix{Float64}(undef, nsites(tyrants), length(nodevec))
GNDs = Vector{Float64}(undef, length(nodevec))
@progress for (i, node) in enumerate(nodevec)
println(i)
SOSs[:,i], GNDs[i] = process_node(tyrants, tree, node)
end
SOSs, GNDs
end
SOSs, GNDs = node_based_analysis(tyrants, tree);([NaN NaN … NaN -0.34786793645020087; NaN NaN … NaN -0.2918386665664969; … ; NaN NaN … NaN -0.38640804568043124; NaN NaN … NaN -0.6188984718491547], [NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN … NaN, 0.7743467401376287, NaN, NaN, 0.2285856668337276, NaN, NaN, NaN, NaN, 0.36270855336718444])Let's visualize the GND values on the tree
plot(tree, showtips = false, marker_z = GNDs, color = cgrad(:YlOrRd, 10, categorical = true))
We notice that a few of the nodes stand clearly out with significant distributional changes